Introduction

Matrix Multiplication

利用矩阵乘法介绍优化方案

MatrixMultiplication

采用 Python、Java、C 的运行时间不一样

Why is Python so slow and C so fast?

  • Python is interpreted.
  • C is compiled directly to machine code.
  • Java is compiled to byte-code, which is then interpreted and just-in-time (JIT) compiled to machine code.

i、j、k 循环调换位置后运行时间不一样

  • cache hits 和 cache misses

ijk1

ijk2

ijk3

ijk4

ijk5

Clang 优化

不一定 O3 优化比 O2 优化快,有时候 O2 快,有时候 O3 快

Parallel Loops

The cilk_for loop allows all iterations of the loop to execute in parallel.

Rule of Thumb Parallelize outer loops rather than inner loops.

进一步优化-重用数据(tiling)

  • Restructure the computation to reuse data in the cache as much as possible. (Cache misses are slow, and cache hits are fast.)
  • Try to make the most of the cache by reusing the data that’s already there.

Parallel divide-and-conquer

cilk_spawn
/*
The child function call is spawned, meaning it may execute in parallel with the parent caller.
*/
cilk_sync
/*
Control may not pass this point until all spawned children have returned.
*/

Compiler Vectorization(编译矢量化)

Many machines don’t support the newest set of vector instructions, however, so the compiler uses vector instructions conservatively by default

更多方法

  • Preprocessing
  • Matrix transposition
  • Data alignment
  • Memory-management optimizations
  • A clever algorithm for the base case that uses AVX intrinsic instructions explicitly

综上优化效果

time

results matching ""

    No results matching ""